NONUNIVERSAL POWER LAW SCALING IN THE PROBABILITY 
DISTRIBUTION OF SCIENTIFIC CITATIONS 
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Abstract. We develop a model for the distribution of scientific citations. The model involves a dual 
mechanism: in the direct mechanism, the author of a new paper finds an old paper A and cites it. In 
the indirect mechanism, the author of a new paper finds an old paper A only via the reference list of a 
newer intermediary paper B, which has previously cited A. By comparison to citation databases, we find 
that papers having few citations are cited mainly by the direct mechanism. Papers already having many 
citations ('classics') are cited mainly by the indirect mechanism. The indirect mechanism gives a power-law 
tail. The 'tipping point' at which a paper becomes a classic is about 21 citations for papers published in the 
Institute for Scientific Information (ISI) Web of Science database in 1981, 29 for Physical Review D papers 
published from 1975-1994, and 39 for all publications from a list of high h-index chemists assembled in 2007. 
The power-law exponent is not universal. Individuals who are highly cited have a systematically smaller 
exponent than individuals who are less cited. 
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Commonly observed in nature and in the social 
sciences are probability distribution functions that 
appear to involve dual underlying mechanisms, with 
a 'tipping point' between them. Examples of such 
probability distributions include the distributions of 
city sizes [H [2] ; fluctuations in stock market indices 
[3J 0]; U.S. firm sizes [51|B]; degrees of Internet nodes 
0E]; numbers of followers of religions [S]; gamma-ray 
intensities of solar flares [9] ; sightings of bird species 
[8] ; and citations of scientific papers [10l [HJ [12l H2] . 
In these situations, a distribution p(k) may have ex- 
ponential behavior for small k and a power-law tail 
for large k. Here we develop a generative model for 
one such dual-mechanism process, scientific citations, 
for which databases are large and readily available. 
Here, k represents the number of citations a paper 
receives, ranging from to hundreds or, sometimes, 
thousands. p(k) is the distribution of the relative 
numbers of such citations, taken over a database of 
papers. 

There have been several important studies of 
power-law tails of distributions, including those in- 
volving scientific citations. Price noted that highly 
cited scientific papers accumulate additional citations 
more quickly than papers that have fewer citations 
[14] . He called this 'cumulative advantage' (CA): the 



probability that a paper receives a citation is propor- 
tional to the number of citations it already has. Price 
showed that this rule asymptotically gives a power 
law for large k. Power-law tails have been widely ex- 
plored in various contexts and under different names 
- 'the rich get richer', the Yule process [ISI HE], the 
Matthew effect 17! , or preferential attachment |18j . 
Barabasi and Albert noted that networks, such as 
the World Wide Web, often have power-law distribu- 
tions of vertex connectivities, called 'scale-free' be- 
havior T8]. Their model, called preferential attach- 
ment, leads to a fixed power-law exponent of —3. Be- 
cause many properties of physical systems near their 
critical points also display power-law behavior, and 
because such exponents are often universal (i.e., in- 
dependent of microscopic particulars of the system) , 
it raises the question of which power-law distributions 
have universal exponents and which do not. 

The tail of the scientific citations distribution has 
been fit by various distributions, including power 
law [TU1 [IS], log-normal [5D], and stretched expo- 
nential [21]. Recently, Clauset, Shalizi, and New- 
man proposed detailed statistical tests for determin- 
ing whether various data sets have true power-law 
tails [8]. In agreement with Redner's earlier analysis 
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[TU] . Clauset et al. confirm that the 1981 data set 
studied by Redner is indeed well-fit by a power-law. 

Our interest here is not just in the large-fc tails of 
such distribution functions. We are interested also 
in the small-fc behavior and the tipping point be- 
tween the two different regions. After all, the prepon- 
derance of scientific papers are not cited very com- 
monly. Some previous models have explored both 
small- fc and large-fc regimes of citations. In 2001, 
Krapivsky and Redner developed a rate equation 
method to obtain solutions for several generalizations 
of the CA model, including results for nonlinear con- 
nection probabilities |22j . Krapivsky and Redner pro- 
posed a 'growing network with redirection' (GNR) for 
the citations network. They proposed that new pa- 
pers could randomly cite existing papers, or could 
be redirected to one of the papers in its reference 
list. The GNR mechanism leads to a distribution 
with a non-universal scaling exponent, depending on 
the value of the redirection parameter. An analysis 
of this mechanism for arbitrary out-dcgrcc distribu- 
tion was carried out by Rozenfeld and ben-Avraham 
[53"] . Recently, Walker et al. proposed a redirection 
algorithm to rank traffic to individual papers, which, 
instead of an initial random attachment probability, 
used an exponentially decaying probability of cita- 
tion, according to the age of the paper [53]. There 
have been many variations proposed of the basic CA 
model, including CA with error tolerance [25], with 
an attractiveness parameter [26j . with a fitness pa- 
rameter [37], with memory effects [55], with hierar- 
chical organization [55] , with aging nodes [3U] , and a 
number of others. A useful overview of CA models, 
and power laws in general, is by Newman [5J. 

Here, we develop a model to address three points of 
particular interest to us. First, existing models focus 
on the power-law tail. We are interested here in the 
full distribution function and the nature of the tran- 
sition, or the 'tipping point,' from one mechanism to 
the other. Second, we seek a mechanism that illumi- 
nates why the 'rich get richer' in scientific citations. 
Third, a strictly linear attachment rule predicts a sin- 
gle fixed exponent, 7 = 3, where p(k) oc /c~ 7 . Here, 
we ask whether the power-law exponent for scientific 
citations is a universal constant, as is often observed 
in the physics of critical phenomena, or whether the 
power-law exponent for citations is a non-universal 
parameter which varies from one dataset to another. 



The two-mechanism model we propose here is sim- 
ilar to the GNR model studied in [55] , generalized for 
an out-degree greater than one. A general treatment 
of the GNR model with arbitrary out-degree distribu- 
tion was given in |23| . Here, we derive p(k) explicitly 
for the specific case of a fixed out-degree, and ana- 
lyze the 'tipping point' transition between the two 
mechanisms. We then fit our p(k) to several citations 
datasets, and examine how the interactions between 
the two mechanisms produces different distributions 
(with different tipping points) for each dataset. By 
sorting our datasets according to /i-index, we show 
that the scaling exponent, 7, decreases systemati- 
cally with increasing values of h. We interpret the 
changes in the scaling exponent using a parameter 
of our model as an increasing bias towards indirect 
citation of well-known scientists. 

1. A Two-Mechanism Model 

Consider a directed graph on which each node rep- 
resents a scientific paper. Each edge represents a ci- 
tation of one paper by another. An outgoing edge 
indicates giving a citation, and an incoming edge indi- 
cates receiving a citation. At a given time, the graph 
has N nodes, representing old papers that are already 
part of the graph. At each time step, a new paper is 
published (a node is added to the graph). Each new 
paper gives a fixed number of citations, n, distributed 
among the N old papers. Hence the total number of 
citations given is Nn, and the total number of ci- 
tations received is also Nn. In general, we consider 
situations in which N is large. Let k be the number of 
incoming links (citations) that a paper has received. 
For example, a paper that has received no citations 
from other papers has k = 0. Some 'classic' papers 
have attracted more than k — 1000 citations. A given 
collection of papers will have a distribution, p(k), of 
papers that have received k = 0, 1, 2, . . . citations. 

We first focus on a particular old paper, paper A. 
The probability that a new paper will randomly link 
to paper A is 

'"direct — '■ (1) 

We call Equation [l] the direct mechanism of cita- 
tions 

In addition, scientific papers are also cited by an 
indirect mechanism: the author of the new paper may 
first find a paper B and learn of paper A via B's ref- 
erence list. On the citation graph, searching through 



Because each new paper will not cite an old paper more than once, the direct probability, Eq. |l| of the first citation is 
1/N, for the second citation is 1/(JV— 1), and so on, and for the n th citation is 1/(N — n + 1). For real-world graphs, however, 
N is of the order of 500,000 and n is around 20. So, we assume N 2> n, and 1/(N — n + 1) ~ 1/N. Similarly, the indirect 
probability, as Nn 2> n, Eq. [2]is approximately k/(Nn — n + 1) ~ k/(Nn). Note also that, perhaps unrealistically, no special 
weight is given to the possibility of simultaneously citing both paper A and one of its references. 
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Figure 1. Probability of receiving exactly k citations (PDF) and at least k citations (CDF, 
inset) for datasets 1 (left), 2 (center), and 3 (right). Empirical data points are shown as 
blue diamonds, and best-fit curves as solid red lines. 



£Ts reference list is a nearest- neighbor- link mecha- 
nism. Suppose there are already k incoming links to 
paper A. Because there are a total of nN incoming 
links to all papers, the probability that the author 
of the new paper randomly finds paper A, via the 
reference list of some other paper is 

k 

Indirect (&) — "TT • (2) 

l\n 

Given that the author of the new paper has found 
old paper A, the author will either cite a paper from 
A's reference list with probability c, or cite A itself 
with probability 1 — c. If paper A currently has k 
citations, then the number of citations, R(k), to pa- 
per A from a new paper, through either the direct or 
indirect mechanism, is 

R(k) = ri[(l-c)r dircct + cr indircct (fc)] (3) 
n(l — c) kc 
N + N' 

Next, we compute the in-link distribution p(k), the 
fraction of the N papers that have k incoming cita- 
tions. The total number of papers having k citations 
is iVp(fc)j^] We calculate p(k) using a difference equa- 
tion to express the flows into and out of the bin of 
papers having k citations for each time step (each 
time a new node is added). The population of the 
bin of papers with k citations increases every time a 



paper with k — 1 citations receives another citation 
and decreases every time a paper that already has k 
citations receives another citation, 

p{k) = N [R(k - l)p{k - 1) - R(k)p(k)] (4) 

= [n(l -c) +c(k- l)]p(k- 1)- 

[n(l — c) + ck] p(k). 

Equation [4] rearranges to: 

a-l + k 



p{k) 



•p(fc-l). 



a + l/c + k 

where, to simplify the notation, we have defined 



n 

a = n. 

c 



(5) 



(6) 



The equation for p(0) involves no inflow from a 
lesser bin. Instead, the inflow comes from the addi- 
tion of a new paper per time step, which is 1 by def- 
inition. The outflow term is calculated as for other 
values of k. Therefore, p(0) = 1—n (1 — c) p(0), which 
rearranges to: 

= hrr- (7) 

n — nc + 1 

Substituting in Equation [7] and applying Equation [5] 
recursively give^] 

1 (a-l + k)\(a + l/c)\ 



p(k) = 



ac+l (a- l)\(a + l/c + k)V 



(8) 



The in-link distribution should be considered a function of both k and N, p(k,N). However, we find that in the large N 
limit, the difference between p(k,N) and p(k,N — 1) decreases as X/N. It is therefore vanishingly small for very large N, and 
limjv^oo p(k, N) = p{k). 

^The factorials in Equation [8] are understood to be gamma functions for non-integer 1/c values. To show that equation [S] is 
normalized, we use 

f (a-l + fc)! («-!)! 
^ (a + l/c + k)\ ' (a + i/cy: 

Substituting into 8 we find that ^^p(k) = 1, as required. 
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Table 1. Fitting parameters for datasets 1-3 

Dataset c n 7 a N 

1. All 1981 publications 0.454 ± 0.004 17.3 ± 0.3 3.20 ± 0.02 20.8 ± 0.4 415229 

2. High /i-index chemists 0.517 ±0.001 42.0 ±0.1 2.935 ± 0.005 39.2 ±0.1 245461 

3. Phys. Rev. D publications 0.48 ± 0.03 27 ±2 3.1 ± 0.1 29 ± 3 5327 



When a is sufficiently large, we apply Stirling's ap- 
proximation to Equation [8j which yields 

(a+ l/c) Q+1 / c / ~ 1 1 x a+k 



(etc + 1) (a 
x (a — 



1) 



1 + 



a - 1 + 
1 



- + k 



-X/c 



(9) 



In the large-fc tail (fc 3> a), we have 

a+k 



a 



1 



+ l/c+ fc 



-(l+l/c) 



and 



(a - l + fc)~ 



-l/c 



-(l + l/c) 



p(fc) 



fc-a+vo). (10) 



Therefore, Equation [9] becomes, in the large-fc tail: 

"(a + l/c)«+ 1 /'=e-( 1 + 1 / c )" 

(ac+l)(a- 1) Q_1 

Equation [9] gives our model's prediction for the 
distribution of citations. It expresses both the direct 
and indirect citation mechanisms. Equation 10 in- 



dicates that once a paper's number of citations, k, 
is large enough, further citations of that paper un- 
dergo a sort of runaway growth because there are so 
many ways to find it through other papers that have 
already cited it; for scientific citations, 'the rich get 
richer.' The 'tipping point' where redirect overtakes 
^direct happens at 

k = a. (11) 
For example, if c = 1/2 and the average paper in the 
database gives out n = 15 citations, then after any 
particular paper in that database has received 15 ci- 
tations, it will begin to accumulate citations signifi- 
cantly faster than random - it will have 'tipped over' 
into the power-law scaling region. In this region, the 
power law exponent, 

7=1 + ". (12) 
c 

is determined by the parameter c. Hence, 'cumula- 
tive advantage' arises in our model because there are 
more routes (through the reference lists of other pa- 
pers) for finding a classic paper than for finding a 
non-classic paper. 



2. The Datasets 

Figure 1 shows fits to normalized empirical prob- 
ability distribution functions (PDFs, the probabil- 
ity of receiving exactly k citations) and complemen- 
tary cumulative distribution functions (CDFs, the 
probability of receiving at least k citations), P(k) = 
f k , p{k')dk' , for three datasets: 

(1) Citations of publications catalogued in the 
ISI Web of Science database in 1981 PH] 

(2) Citations of publications by authors on 
a 2007 list of the living highest /i-index 
chemists [55] 

(3) Citations of publications in the Physical Re- 
view D journal from 1975-1994 [TU] 

Datasets 1 and 3 were downloaded from Sidney Red- 
ner's websit^] We gathered dataset 2 from the ISI 
Web of Knowledg^] using a Python script. Parame- 
ters for these fits are shown in Table 1, and plots of 
the datasets and best-fit p(k) distributions are shown 
in Figure 1. We also sorted dataset 2 by h-mdex. 
Parameters for different /i-index ranges are shown in 
Table 2, and fits are shown in Figure 2. The relation 
between our estimates of 7 and h is shown in Figure 
3. To obtain estimates and 95% confidence intervals 
of c and n, we used Matlab's implementation of the 
iteratively reweighted least squares algorithm, using 
bisquare weights [32j . All curve fitting was applied 
to the raw (not binned or log-transformed) data. 



3. Results 

Our model has two parameters: n, the average 
number of citations given out by all the papers in the 
database, and c, the chance of citing from a paper's 
reference list. The model power-law exponent is then 
fixed by the relationship 7 = l + l/c. Our best fit of 
dataset 1 gives a value of n = 17.3 ± 0.3, in approx- 
imate agreement with the independent estimate of 
15.01 found for papers published in 1980 [53] . Also, 
our predicted value of 7 = 3.20 ± 0.02 agrees with 
the best-fit power-law exponent previously found by 
Clauset, of 7 = 3.16 [8]. Table 1 shows the best-fit 
parameter values for the three different datasets. 



http:/ /physics. bu.edu/~redner/projects/citation/index. html 
5 http: / /isiwebofknowledge.com 
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Figure 2. Comparison of the normalized PDFs and CDFs (inset) for chemists with h 
100+ (red) and chemists with h = 50-53 (blue). 



We explored thep(fc) distributions for small groups 
of scientists, as shown in Figure 2. We wanted to test 
an alternate hypothesis that some scientists might 
publish only low-fc papers and others might publish 
only classic high-fc papers. Our limited tests argue 
against this hypothesis. Figure 2 indicates that even 
highly cited scientists have more low-k papers than 
high-A: papers. One reason is that every publication 
in the scientific literature is new for a while, and re- 
quires some time to become highly cited. 

Interestingly, the slope of the power-law region 
differs between the two groups shown in Figure 2. 
To examine this difference in more detail, we parsed 
dataset 2 by ft,-index (Table 2). The h- index of a sci- 
entist is defined as the point where h of the scientist's 
papers have at least h citations each [31] . That is, h is 
defined by the requirement to satisfy the expression, 
Np(h) — h. There is no simple analytical relation- 
ship between a scientist's h- index and the parameters 
of our model. 

From Table 2, we conclude that c increases with h- 
index, indicating that there is a bias towards selecting 
papers out of a reference list that were written by sci- 
entists who are already very highly cited (Figure 2). 
This bias may reflect the tendency of authors who, 
scanning a paper's references for further information, 
are more likely to select a paper written by an author 
they have previously heard of. The more highly cited 
the scientist, the lower his or her power-law exponent 
(i.e., the fatter the tail); see Figure 3. The error bars 
are sufficiently small to indicate that these trends are 
real, and that there is not a single universal exponent, 
such as 7 = 3; rather, the exponent depends on the 
subset of scientists examined. Note that, here, we 



consider a scientist to have authored a paper if his 
or her name appears anywhere in the list of authors. 
An interesting question for future work might be to 
examine whether this effect is changed by only con- 
sidering the ft,-index of each paper's leading and/or 
corresponding author. 

Our model bears some resemblance to Price's ap- 
plication of CA to scientific citations 14]. One key 
difference is that our two parameters both have phys- 
ical meaning. To avoid the issue of new papers hav- 
ing a citation probability of zero when k = 0, Price 
proposed that the citation probability should be pro- 
portional instead to k + w, where w is a constant 
that he refers to as a 'fudge factor.' He sets w = 1, 
although as later noted by Newman, there does not 
seem to be a good reason to choose this value [9]. 
The connection rule for our model is given by Equa- 
tion [3j and suggests a simple interpretation: Price's 
constant arises from random connections, and the tip- 



ping point, Equation 11 is determined by the average 
size of the reference lists given out per paper, and the 
probability of searching through those reference lists. 

This two-mechanism model also provides a justi- 
fication for a CA mechanism. Barabasi and Albert 
remarked that CA only produced a power law distri- 
bution when the connection probability was linearly 
proportional to k |18j . but it was not clear what was 
special about linearity. The present model presents 
a possible explanation for the existence of this mech- 
anism, and why the k dependence should be linear: 
k appears in redirect because a paper's k incoming 
citations are represented by k nearest-neighbor links 
on the graph. 
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Figure 3. Power-law exponent 7 plotted against /i-index for subsets of dataset 2. 



4. Conclusion 

We have developed a model of scientific citations, 
involving both direct and indirect routes to finding 
and citing papers. This two-mechanism model pre- 
dicts exponential behavior in the small-fc region and 
power law tails in the large-fc region. One parame- 
ter of the model, n, is the average number of cita- 
tions given out per paper. Our best-fit value of n is 
consistent with an independent, empirical measure of 
it made by Biglu |34) . Our other parameter, c, de- 
fines the power-law exponent, 7 = 1 + 1/c, which is 
in agreement with data previously evaluated in [8]. 
Two key findings here are: (1) the tipping point for 
a paper to reach 'classic-paper' status, i.e. its power- 
law citation region, is about 21 citations for the ISI 
Web of Science database, and (2) the power-law ex- 
ponent is not a universal feature of all scientific cita- 
tions. The exponent diminishes systematically with 



increasing ft,-index of a scientist. Our model describes 
systems that are governed by random choices in the 
small-fc region, cumulative advantage in the high-fc 
region, and a tipping point between them. 
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Table 2. Fitting parameters for h- index ranges within dataset 2 



h range 


c 


n 


7 


a 


N 


100+ 


0.57 ±0.01 


80 ±3 


2.77 ±0.05 


60 ±2 


11029 


90-99 


0.54 ±0.01 


77 ±3 


2.86 ±0.05 


66 ±3 


11476 


80-89 


0.53 ±0.01 


60 ±2 


2.89 ±0.04 


53 ±2 


15408 


70-79 


0.513 ±0.003 


40.6 ±0.4 


2.95 ±0.01 


38.5 ±0.4 


54236 


60-69 


0.494 ±0.002 


48.7 ±0.4 


3.02 ±0.01 


49.9 ±0.5 


56052 


54-59 


0.493 ± 0.003 


34.9 ±0.3 


3.03 ±0.01 


35.9 ±0.4 


44715 


50-53 


0.489 ±0.003 


31.3 ±0.3 


3.04 ±0.01 


32.7 ±0.4 


46421 
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